Urban Greenspace and High Blood Pressure Prevalence: Chicago, IL Case Study¶
Introduction¶
Vegetation in urban areas serves many ecological purposes such as reducing urban heat island effect, air quality, water quality, stormwater mitigation, and more. The results between specific human health outcomes and a simple vegetation metric such as average NDVI are often inconclusive. There is strong research to support that less access to greenspace in general has a negative affect on mortality, heart rate, and violence, and access to greenspace has a positive influence on attention, mood, and physical activity. There is less abundance of research studies and mixed results looking at greenspace access and chronic health conditions such as asthma, diabetes, and high blood pressure (Kondo et al., 2018). Additionally, less research looks at correlations at finer scale landscape metrics. Landscape metrics such as edge density, mean patch size, and fragmentation can be used to attempt to quantify these relationships between greenspaces and human health. These metrics can be used to explore the relationships between urban greenspaces and human health. This data analysis will explore these metrics at the census tract level in relation to high blood pressure rates in Chicago, IL. High blood pressure is a primary indicator for cardiovascular disease and is the leading cause of death worldwide. While studies have shown a beneficial effect from greenspaces on high blood pressure, the results remain mixed or inconclusive (Zhao et al., 2022). Therefore, it remains important to explore the relationship between high blood pressure and greenspaces, especially quantitative studies.
Study Area¶
Chicago, IL is the third largest city in the United States and is home to more than 2,721,308 people. The Chicago metro area has a larger footprint at around 9.5 million people. It is a very culturally diverse city with a large population of immigrants and a racial makeup of roughly 33-36% White, 29% Black or African American, 29-30% Hispanic or Latino, and 6-7% Asian (U.S. Census Bureau, 2024). While the racial composition of the city appears balanced at the city level, Chicago still experiences heavy residential segregation, containing specific racial-majority areas across census tracts. Chicago is a highly urbanized city that has dense residential communities, transportation infrastructure networks, industrial areas, and urban greenspaces, resulting in significant spatial variation. Chicago is a good case study to explore the relationships between greenspaces and high blood pressure, due to this wide variation across census tracts.
Methods¶
This data analysis explores the relationship between high blood pressure and urban greenspace through metrics such as edge density, mean patch size, and fragmentation derived from Normalized Difference Vegetation Index (NDVI) calculations from the National Agricultural Imagery Program (NAIP) spatial data. NAIP spatial data is collected at a fine spatial resolution of 0.6-meter ground sample distance (GSD) for the purpose of being able to calculate these landscape metrics. The temporal resolution is coarser for NAIP imagery as it is only conducted every few years compared to the temporal resolution of satellite imagery. For this analysis, fine spatial resolution with coarse temporal resolution is suitable as spatial detail is what is needed, to compare with human health data. These NDVI statistics are combined with the 2024 release of the Center for Disease Control (CDC) PLACES dataset. This dataset contains human health data for a variety of chronic health conditions and diseases including high blood pressure, at the census tract scale. These datasets are combined to explore the relationship and potential correlation between urban greenspace and high blood pressure. Due to the quantity of data in this analysis, to save space and memory, the calculations are performed in the cloud rather than downloading the raster data to the computer.. The model uses an Ordinary Least Squares Linear Regression Model, described further below.
STEP 1: Set up analysis¶
### Import libraries
# File paths and organization
import os
import pathlib
# See warnings
import warnings
# Work with deifferent data types (tabular, vector, raster)
import pandas as pd
import geopandas as gpd
import numpy as np
import rioxarray as rxr
import xarray as xr
import rioxarray.merge as rxrmerge
import shapely
import time
# Plotting and visuallization packages
import geoviews as gv
import holoviews as hv
import hvplot.pandas
import hvplot.xarray
from cartopy import crs as ccrs
# Data exploration and visualiztion
import matplotlib
import scipy.stats as stats
import matplotlib.pyplot as plt
# Image processing
from scipy.ndimage import convolve
from scipy.ndimage import label
# Work with data in the cloud
import pystac_client
from sodapy import Socrata
# Use progress bar
from tqdm.notebook import tqdm
# OLS sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
# Create reproducible file paths
data_dir = os.path.join(
# Home directory
pathlib.Path.home(),
# Earth Aaalytics
'Documents',
'education',
'earth-data-analytics',
'spring-2026-data',
# Project directory
'hbp-urban-green-big-data'
)
# Make the dir once
os.makedirs(data_dir, exist_ok=True)
# Revent GDAL from quitting due to momentary disruptions
os.environ["GDAL_HTTP_MAX_RETRY"] = "5"
os.environ["GDAL_HTTP_RETRY_DELAY"] = "1"
STEP 2: Create a site map¶
### Set up the census tract path
# Define the directory
tract_dir = os.path.join(data_dir, 'chicago-tract')
# Make the directory
os.makedirs(tract_dir, exist_ok=True)
# Make path to directory
tract_path = os.path.join(tract_dir,'chicago-tract.shp')
# Download the census tracts (only once)
if not os.path.exists(tract_path):
# Add in census url
tract_url = ('https://data.cdc.gov/download/x7zy-2xmx/application%2Fzip')
# Read the shp file
tract_gdf = gpd.read_file(tract_url)
# Subset to Chicago
chi_tract_gdf = tract_gdf[tract_gdf.PlaceName == 'Chicago']
# Save it as a file
chi_tract_gdf.to_file(tract_path, index = False)
# Load in census tract data
chi_tract_gdf = gpd.read_file(tract_path)
# Show
chi_tract_gdf
# Site plot -- Census tracts with satellite imagery in the background
(
chi_tract_gdf
# set projection
.to_crs(ccrs.Mercator())
# Plot with satellite image
.hvplot(
line_color = 'orange',
fill_color = None,
crs = ccrs.Mercator(),
tiles = 'EsriImagery',
frame_width = 600,
title = "Chicago, IL Census Tract Map (CDC PLACES Dataset - 2024 Release)"
)
)
Initial Map Reflection¶
Visually, the greenspace does not appear to be evenly distributed. It appears that many tracts contain a large amount of greenspace or they do not contain much greenspace at all. Many of the tracts appear nearly uniform in the shade of green or grey within their boundaries. Chicago historically has, not unlike many other cities, experienced a history of red-lining, inequitable planning decisions, and differences in funding distribution across the city (Mitchell, 2025). It has been found that the greenspace available in different census tracts correlates to racial majorities and income distribution. Tracts with higher Black and Hispanic populations having less access and availability of urban greenspace. It has also been shown that White-majority census tracts have better access regardless of income whereas income inequality and access to urban greenspace is higher for Black and Hispanic populations (Lui et al., 2021).
Context¶
Redlining, a common practice in the 1930s and 1940s, was when the Home Owners’ Loan Corporation (HOLC) and the Federal Housing Administration (FHA) classified different neighborhoods often based on racial distributions as higher risk or lower risk. Areas with higher populations of black and brown Americans were classified as high risk while white neighborhoods were classified as low risk. This didn't just affect the communities at the time. Researchers have shown that the effects continue to affect funding, home values, property values, healthcare access, and more for the decades to follow (Schwegman, 2025). This could be evidence as to why we see less greenspace in historically Black or Hispanic neighborhoods, as funding for greenspace conservation and parks comes through investment. Underinvestment in the communities that have higher minority populations throughout the last 100 years would lead to less greenspace access, vegetation conservation and parks in those census tracts.
Many areas and census tracts in Chicago are Black or Hispanic-majority, meaning that there are many people without access to urban greenspace which is proven to be beneficial to human health outcomes and overall well-being. According to Lui et al. (2021), since Black and Hispanic-majority communities experience the greatest disparities in income-driven access to greenspaces. Special attention to be paid to Black and Hispanic-majority neighborhoods that are also low income. They are the most disadvantaged group and have historically received less funding due to long-term systemic racial discrimination.
While this data analysis focuses directly on correlations between high blood pressure and urban greenspace metrics, such as edge density, this important contextual background on census tract demographics should be considered in understanding this relationship and clustering.
Data Description & Citation
The following analysis using geospatial friendly census tract data retrieved above from the 2024 release of the Center for Diseas Control and Prevention (CDC) PLACES dataset.
Data Citation: Centers for Disease Control and Prevention. (2024). PLACES: Census tract data [Data set]. U.S. Department of Health & Human Services. https://data.cdc.gov/download/x7zy-2xmx/application/zip
# Set up a path for the asthma data
cdc_path = os.path.join(data_dir, '.csv')
# Download asthma data (only once) - conditional statement, if it exists do this, if not...
if not os.path.exists(cdc_path):
cdc_url = (
### url for data
"https://data.cdc.gov/resource/cwsq-ngmh.csv"
### parameters to filter on
"?year=2023"
"&stateabbr=IL"
"&countyname=Cook"
"&measureid=BPHIGH"
"&$limit=1500"
)
cdc_df = (
# Open as csv
pd.read_csv(cdc_url)
# Rename the cols
.rename(columns = {
'data_value': 'high_blood_pressure',
'low_confidence_limit': 'hbp_ci_low',
'high_confidence_limit': 'hbp_ci_high',
'locationname' : 'tract'})
# Select colums to keep
[[
'year', 'tract',
'high_blood_pressure', 'hbp_ci_low', 'hbp_ci_high',
'data_value_unit', 'totalpopulation',
'totalpop18plus'
]]
)
# Download it to a csv
cdc_df.to_csv(cdc_path, index = False)
# Load in asthma data
cdc_df = pd.read_csv(cdc_path)
# Preview asthma data
cdc_df
| year | tract | high_blood_pressure | hbp_ci_low | hbp_ci_high | data_value_unit | totalpopulation | totalpop18plus | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2023 | 17031140800 | 23.0 | 20.9 | 25.1 | % | 6486 | 5247 |
| 1 | 2023 | 17031030500 | 21.7 | 19.6 | 23.7 | % | 6183 | 5381 |
| 2 | 2023 | 17031031502 | 31.9 | 29.4 | 34.3 | % | 4712 | 4125 |
| 3 | 2023 | 17031062700 | 16.9 | 15.2 | 18.5 | % | 2955 | 2425 |
| 4 | 2023 | 17031081800 | 14.8 | 13.2 | 16.5 | % | 11373 | 10668 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1323 | 2023 | 17031833900 | 39.9 | 37.2 | 42.6 | % | 2333 | 1598 |
| 1324 | 2023 | 17031840700 | 21.3 | 19.3 | 23.4 | % | 3900 | 2885 |
| 1325 | 2023 | 17031840800 | 29.0 | 26.5 | 31.5 | % | 3332 | 2380 |
| 1326 | 2023 | 17031832600 | 16.5 | 14.7 | 18.2 | % | 4147 | 3442 |
| 1327 | 2023 | 17031843000 | 44.6 | 41.7 | 47.3 | % | 2880 | 2004 |
1328 rows × 8 columns
Step 3b - Join health data with census tract boundaries¶
# Change tract identifier datatype for merging
chi_tract_gdf.tract2010 = chi_tract_gdf.tract2010.astype('int64')
# Merge census data with geometry
tract_cdc_gdf = (
chi_tract_gdf
.merge(cdc_df,
# Specify columns to merge on
left_on = 'tract2010',
right_on = 'tract',
# Use inner join
how = 'inner'
)
)
# Show the gdf
tract_cdc_gdf
# Plot asthma data as chloropleth
(
gv.tile_sources.EsriImagery
*
# Map census tracts with asthma data
gv.Polygons(
# reprohect to align CRS
tract_cdc_gdf.to_crs(ccrs.Mercator()),
# Select columns
vdims = [('high_blood_pressure', 'HBP Rate (%)'), ('tract2010', 'Census Tract')], # Added column names for hover
# Ensure CRSs align
crs = ccrs.Mercator()
# Add in plot specifications
).opts(color = 'high_blood_pressure', colorbar = True, tools = ['hover'])
).opts(width = 600,
height = 600,
xaxis = None,
yaxis = None,
title = "Chicago Census Tracts by 2023 High Blood Pressure Rates (%)", # Added title
)
The spatial distribution appears to have clustering of high blood pressure (HBP) rates, with tracts with higher HBP rates adjacent to one another and lower HBP rates also being clustered. We will explore whether or not we see an association with HBP and landscape metrics such as edge density and mean patch size. Based on visual association alone and comparing the Chicago Racial Dot Map 2020 (2024), the highest rates are very strongly visually associated with Black-majority communities and the lowest rates among White-majority census tracts. Redlining practices, systemic racism, and subsequent underinvestment likely contributes to these higher HBP clusters. This could indicate to urban planners and government officials that access to urban greenspace is a public health concern as well as an environmental justice issue.
Data Description & Citation
A spatial join was completed with Center for Disease Control and Prevention (CDC) data set providing high blood pressure rates for Chicago, IL. The data was filtered for the year 2023.
Data Citation Centers for Disease Control and Prevention. (2024). PLACES: Local data for better health—Asthma [Data set]. U.S. Department of Health & Human Services. https://data.cdc.gov/resource/cwsq-ngmh.csv
# Number of unique tracts
tract_cdc_gdf['tract2010'].nunique()
788
Step 3c - Get data URLs for urban greenspace imagery¶
NAIP data is freely available through the Microsoft Planetary Computer SpatioTemporal Access Catalog (STAC).
# Connect to the planetary computer catalog
e84_catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1"
)
# Convert geometry to lat/lon for STAC
tract_latlong_gdf = tract_cdc_gdf.to_crs(4326)
# Define a path to save NDVI stats
ndvi_stats_path = os.path.join(data_dir, 'chicago-ndvi-stats.csv')
### Check for existing data - do not access duplicate tracts
# Initialize an empty list for the census tract IDs
downloaded_tracts = []
# Write the conditional
if os.path.exists(ndvi_stats_path):
# If it exists, open the csvs in the path
ndvi_stats_df = pd.read_csv(ndvi_stats_path)
# Fill the list with the tract values
downloaded_tracts = ndvi_stats_df.tract.values
# If it doesn't already exist
else:
print('No census tracts downloads so far.')
No census tracts downloads so far.
# Run 1 interation
i = 0
tract_values = tract_latlong_gdf.iloc[i]
tract = tract_values.tract2010
tract
np.int64(17031010100)
# Check if stats already download for this tract
if not (tract in downloaded_tracts):
print('proceed with the STAC search')
proceed with the STAC search
# Define naip_search
naip_search = e84_catalog.search(
# Filter naip data
collections = ['naip'],
# Grab 2021 naip imagery
intersects = shapely.to_geojson(tract_values.geometry),
datetime = "2021"
)
# Naip_search
list(naip_search.items())[:2]
# Build a dataframe
scene_df = pd.DataFrame(dict(
# Tract column
tract = tract,
# Make date column
data = [pd.to_datetime(scene.datetime).date()
### return the scenes (STAC items)
for scene in naip_search.items()],
# Make rgbir_href
rgbir_href = [scene.assets['image'].href for scene in naip_search.items()],
))
# Disply the first scene information
scene_df
| tract | data | rgbir_href | |
|---|---|---|---|
| 0 | 17031010100 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
### Loop through all the tracts
# Convert geometry to lat/lon for STAC
tract_latlong_gdf = tract_cdc_gdf.to_crs(4326)
# Define a path to save NDVI stats
ndvi_stats_path = os.path.join(data_dir, 'chicago-ndvi-stats.csv')
### Check for existing data - do not access duplicate tracts
# Initialize an empty list for the census tract IDs
downloaded_tracts = []
# Write the conditional
if os.path.exists(ndvi_stats_path):
### If it exists, open the csvs in the path
ndvi_stats_df = pd.read_csv(ndvi_stats_path)
### Fill the list with the tract values
downloaded_tracts = ndvi_stats_df.tract.values
# If it doesn't already exist
else:
print('No census tracts downloads so far.')
### Loop through each census tract
# Make a list to accumulate into
scene_dfs = []
# Loop through each tract value in the gdf
for i, tract_values in tqdm(tract_latlong_gdf.iterrows()):
### For i, extract the tract id
tract = tract_values.tract2010
# Check if statistics are already downloaded for this tract
if not (tract in downloaded_tracts):
# Repeat up to 5 times in case of a momentary disruption
i = 0 # Intialize retry counter
retry_limit = 5 # Make number of tries
while i < retry_limit:
# Try accessing the STAC
try:
# Search for tiles
naip_search = e84_catalog.search(
collections = ['naip'],
intersects = shapely.to_geojson(tract_values.geometry),
datetime="2021"
)
# Build dataframe with tracts and tile urls
scene_dfs.append(pd.DataFrame(dict(
# Columns called tract
tract = tract,
date = [pd.to_datetime(scene.datetime).date()
for scene in naip_search.items()],
rgbir_href = [scene.assets['image'].href for scene in naip_search.items()],
)))
# Exit retry loop once STAC request succeeds
break
# Try again in case of an APIError
except pystac_client.exceptions.APIError:
print(
f'Could not connect with stack server.'
f'Retrying tract [tract]...')
# Wait 2 secs before retrying to not upset the API
time.sleep(2)
i += 1
continue
# Concatenate the url dataframes
if scene_dfs:
# Concatenate
scene_df = pd.concat(scene_dfs).reset_index(drop = True)
# Otherwise, tell me it didn't work
else:
scene_df = None
# Preview the url dataframe
scene_df
No census tracts downloads so far.
0it [00:00, ?it/s]
| tract | date | rgbir_href | |
|---|---|---|---|
| 0 | 17031010100 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
| 1 | 17031010201 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
| 2 | 17031010201 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
| 3 | 17031010202 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
| 4 | 17031010300 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
| ... | ... | ... | ... |
| 1252 | 17031807900 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
| 1253 | 17031807900 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
| 1254 | 17031807900 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
| 1255 | 17031808100 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
| 1256 | 17031808100 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
1257 rows × 3 columns
# Write out scene_df to csv
scene_df.to_csv(ndvi_stats_path, index=False)
Step 3d- Compute NDVI Statistics¶
Fragmentation statistics:
Percentage vegetation: The percentage of pixels that exceed a vegetation threshold (.12 is common with Landsat) Mean patch size
Mean patch size: The average size of patches, or contiguous area exceeding the
vegetation threshold. Patches can be identified with the label
function from scipy.ndimage
Edge density: The proportion of edge pixels among vegetated pixels. Edges can be identified by convolving the image with a kernel designed to detect pixels that are different from their surroundings.
# Open df of NAIP imagery URLs
scene_df = pd.read_csv(ndvi_stats_path)
# Display all scene urls
scene_df
| tract | date | rgbir_href | |
|---|---|---|---|
| 0 | 17031010100 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
| 1 | 17031010201 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
| 2 | 17031010201 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
| 3 | 17031010202 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
| 4 | 17031010300 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
| ... | ... | ... | ... |
| 1252 | 17031807900 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
| 1253 | 17031807900 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
| 1254 | 17031807900 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
| 1255 | 17031808100 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
| 1256 | 17031808100 | 2021-09-08 | https://naipeuwest.blob.core.windows.net/naip/... |
1257 rows × 3 columns
# Test how many of scene_df are unique
print("Number of tracts:", len(scene_df.tract.unique()))
Number of tracts: 788
# Pull out the first tract value
tract = chi_tract_gdf.loc[0].tract2010
tract
# Covert tract to string
tract = str(tract)
tract
'17031010100'
# Convert tracts to string
href_value = scene_df.loc[
scene_df.tract.astype(str) == tract,
### Select column to extract
"rgbir_href"
].iloc[0]
# Disply url
href_value
'https://naipeuwest.blob.core.windows.net/naip/v002/il/2021/il_060cm_2021/42087/m_4208759_se_16_060_20210908.tif'
### Process one scene
# Open the Raster
tile_da = rxr.open_rasterio(
# Give it the url selected
href_value,
# Mask out the NoData pixels
masked = True
# Remove the dimensions of length = 1
).squeeze()
# Check it out
tile_da
<xarray.DataArray (band: 4, y: 12724, x: 9643)> Size: 2GB
[490790128 values with dtype=float32]
Coordinates:
* band (band) int64 32B 1 2 3 4
* y (y) float64 102kB 4.657e+06 4.657e+06 ... 4.65e+06 4.65e+06
* x (x) float64 77kB 4.428e+05 4.428e+05 ... 4.486e+05 4.486e+05
spatial_ref int64 8B 0
Attributes:
TIFFTAG_DOCUMENTNAME: Evanston SE 4208759
TIFFTAG_IMAGEDESCRIPTION: Image courtesy of USDA Farm Service Agency's N...
TIFFTAG_DATETIME: 2021:10:21 20:26:02
TIFFTAG_ARTIST: Surdex Corporation, 636-368-4400, www.surdex.com
TIFFTAG_HOSTCOMPUTER: AMDNODE7
AREA_OR_POINT: Area
scale_factor: 1.0
add_offset: 0.0# Check the dimensions
tile_da.dims
('band', 'y', 'x')
# Print the column names
print(scene_df.columns)
Index(['tract', 'date', 'rgbir_href'], dtype='object')
# Open high blood pressure data spatial aligned with census tracts
tract_cdc_gdf
| place2010 | tract2010 | ST | PlaceName | plctract10 | PlcTrPop10 | geometry | year | tract | high_blood_pressure | hbp_ci_low | hbp_ci_high | data_value_unit | totalpopulation | totalpop18plus | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1714000 | 17031010100 | 17 | Chicago | 1714000-17031010100 | 4854 | POLYGON ((-9758835.381 5164429.383, -9758837.3... | 2023 | 17031010100 | 32.6 | 30.1 | 35.2 | % | 4905 | 4083 |
| 1 | 1714000 | 17031010201 | 17 | Chicago | 1714000-17031010201 | 6450 | POLYGON ((-9760143.496 5163888.741, -9760143.4... | 2023 | 17031010201 | 33.1 | 30.5 | 35.7 | % | 6939 | 5333 |
| 2 | 1714000 | 17031010202 | 17 | Chicago | 1714000-17031010202 | 2818 | POLYGON ((-9759754.212 5163883.196, -9759726.6... | 2023 | 17031010202 | 32.6 | 29.9 | 35.2 | % | 2742 | 2296 |
| 3 | 1714000 | 17031010300 | 17 | Chicago | 1714000-17031010300 | 6236 | POLYGON ((-9758695.229 5163870.91, -9758695.78... | 2023 | 17031010300 | 32.2 | 29.6 | 34.7 | % | 6305 | 5606 |
| 4 | 1714000 | 17031010400 | 17 | Chicago | 1714000-17031010400 | 5042 | POLYGON ((-9757724.634 5160715.939, -9757742.2... | 2023 | 17031010400 | 22.5 | 20.3 | 24.5 | % | 5079 | 4742 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 783 | 1714000 | 17031770700 | 17 | Chicago | 1714000-17031770700 | 0 | POLYGON ((-9780753.304 5157066.079, -9780752.0... | 2023 | 17031770700 | 31.3 | 28.7 | 34.0 | % | 2537 | 2093 |
| 784 | 1714000 | 17031770800 | 17 | Chicago | 1714000-17031770800 | 0 | POLYGON ((-9783235.84 5154620.343, -9783211.23... | 2023 | 17031770800 | 28.4 | 25.8 | 30.9 | % | 5661 | 4457 |
| 785 | 1714000 | 17031805600 | 17 | Chicago | 1714000-17031805600 | 0 | POLYGON ((-9776210.02 5161605.738, -9776213.47... | 2023 | 17031805600 | 25.9 | 23.3 | 28.3 | % | 4710 | 3420 |
| 786 | 1714000 | 17031807900 | 17 | Chicago | 1714000-17031807900 | 0 | POLYGON ((-9768609.902 5160576.634, -9768654.5... | 2023 | 17031807900 | 31.6 | 28.9 | 34.2 | % | 4201 | 3396 |
| 787 | 1714000 | 17031808100 | 17 | Chicago | 1714000-17031808100 | 0 | MULTIPOLYGON (((-9774480.671 5161127.722, -977... | 2023 | 17031808100 | 42.2 | 38.8 | 45.5 | % | 4010 | 3648 |
788 rows × 15 columns
# Check data type
tract_cdc_gdf["tract2010"].head()
0 17031010100 1 17031010201 2 17031010202 3 17031010300 4 17031010400 Name: tract2010, dtype: int64
# Another way to check data type
tract_cdc_gdf["tract2010"].dtype
dtype('int64')
# Check that all are strings
tract in tract_cdc_gdf["tract2010"].astype(str).values
True
# Get boundary of the tract we're working with
boundary = (
tract_cdc_gdf
# Deal with issues with integers & ensure all tracts are strings
.assign(tract2010 = lambda df: df["tract2010"].astype(str)) ### Assign modifies columns
# Use tract ID as the index
.set_index("tract2010")
# Grab the tract wee're using
.loc[[tract]]
# Make the CRS match
.to_crs(tile_da.rio.crs)
# Grab the geometry
.geometry
)
# Display boundary information
boundary
tract2010 17031010100 POLYGON ((444936.583 4652546.967, 444935.043 4... Name: geometry, dtype: geometry
# Crop to bounding box
crop_da = tile_da.rio.clip_box(
# Get coordinates of the bounding box
*boundary.envelope.total_bounds,
# Let it expand a little
auto_expand = True
)
# Display the bounding box info
crop_da
<xarray.DataArray (band: 4, y: 724, x: 1862)> Size: 22MB
[5392352 values with dtype=float32]
Coordinates:
* band (band) int64 32B 1 2 3 4
* y (y) float64 6kB 4.653e+06 4.653e+06 ... 4.652e+06 4.652e+06
* x (x) float64 15kB 4.439e+05 4.439e+05 ... 4.451e+05 4.451e+05
spatial_ref int64 8B 0
Attributes:
TIFFTAG_DOCUMENTNAME: Evanston SE 4208759
TIFFTAG_IMAGEDESCRIPTION: Image courtesy of USDA Farm Service Agency's N...
TIFFTAG_DATETIME: 2021:10:21 20:26:02
TIFFTAG_ARTIST: Surdex Corporation, 636-368-4400, www.surdex.com
TIFFTAG_HOSTCOMPUTER: AMDNODE7
AREA_OR_POINT: Area
scale_factor: 1.0
add_offset: 0.0# Clip to the exact geometry of the census tract
clip_da = crop_da.rio.clip(
boundary,
# Include all pixels touching the boundary
all_touched = True
)
# Display the exact geometry info
clip_da
<xarray.DataArray (band: 4, y: 724, x: 1862)> Size: 22MB
array([[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]]],
shape=(4, 724, 1862), dtype=float32)
Coordinates:
* band (band) int64 32B 1 2 3 4
* y (y) float64 6kB 4.653e+06 4.653e+06 ... 4.652e+06 4.652e+06
* x (x) float64 15kB 4.439e+05 4.439e+05 ... 4.451e+05 4.451e+05
spatial_ref int64 8B 0
Attributes:
TIFFTAG_DOCUMENTNAME: Evanston SE 4208759
TIFFTAG_IMAGEDESCRIPTION: Image courtesy of USDA Farm Service Agency's N...
TIFFTAG_DATETIME: 2021:10:21 20:26:02
TIFFTAG_ARTIST: Surdex Corporation, 636-368-4400, www.surdex.com
TIFFTAG_HOSTCOMPUTER: AMDNODE7
AREA_OR_POINT: Area
scale_factor: 1.0
add_offset: 0.0# Look at the shape of the clipped file
clip_da.shape
(4, 724, 1862)
# Calculate NDVI based on math formula (NIR - Red) / (NIR + Red)
ndvi_da = (
(clip_da.sel(band = 4) - clip_da.sel(band = 1))
/ (clip_da.sel(band = 4) + clip_da.sel(band = 1))
)
# Look at the ndvi results
ndvi_da
<xarray.DataArray (y: 724, x: 1862)> Size: 5MB
array([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
shape=(724, 1862), dtype=float32)
Coordinates:
* y (y) float64 6kB 4.653e+06 4.653e+06 ... 4.652e+06 4.652e+06
* x (x) float64 15kB 4.439e+05 4.439e+05 ... 4.451e+05 4.451e+05
spatial_ref int64 8B 0### Skip this step if data are already downloaded
if not scene_df is None:
### Get an example
# Pull out the identifier for an example tract
tract = chi_tract_gdf.loc[0].tract2010
# Subset scene_df to only include that tract
ex_scene_gdf = scene_df[scene_df.tract == tract]
# Make an empty list to accumulate into
tile_das = []
# Loop through all the images for this tract
for _, href_s in ex_scene_gdf.iterrows():
# Open vsi connection to data
tile_da = rxr.open_rasterio(
# Point to the location of the raster
href_s.rgbir_href,
# Deal with no data, remove extra dimension
masked = True).squeeze()
### Crop data to the bounding box of the census tract
# Make the bounding box
boundary = (
tract_cdc_gdf
# Use the tract ID as index
.set_index('tract2010')
# Set the geometry for the tract
.loc[[tract]]
# Reproject to crs
.to_crs(tile_da.rio.crs)
# Extract the geometry
.geometry
)
# Crop to the bounding box
crop_da = tile_da.rio.clip_box(
# Get the coordinates to the bounding box
*boundary.envelope.total_bounds,
# Expand it a little
auto_expand = True
)
# Clip data to the boundary of the census tract
clip_da = crop_da.rio.clip(
boundary, all_touched = True)
# Compute NDVI
ndvi_da = (
(clip_da.sel(band = 4) - clip_da.sel(band = 1))
/ (clip_da.sel(band = 4) + clip_da.sel(band = 1))
)
# Accumulate result
tile_das.append(ndvi_da)
# Check
print(f"Number of tiles in tiles_da: {len(tile_das)}")
# Merge data
scene_da = rxrmerge.merge_arrays(tile_das)
scene_da
# Mask vegetation
veg_mask = (scene_da > .3)
# Calculate mean patch size
labeled_patches, num_patches = label(veg_mask)
patch_sizes = np.bincount(labeled_patches.ravel())[1:]
mean_patch_size = patch_sizes.mean()
# Check the calculations
patch_sizes
Number of tiles in tiles_da: 1
array([1716, 1, 288, ..., 1, 1, 1], shape=(3426,))
# Check mean patch size result
mean_patch_size
np.float64(55.225919439579684)
# Make kernel to calculate edge density
kernel = np.array([
[1, 1, 1],
[1, -8, 1],
[1, 1, 1]])
# Calculate edge density (0 = no edge, 1 = edge present)
edges = convolve(veg_mask, kernel, mode='constant')
edge_density = np.sum(edges != 0) / veg_mask.size
# Display the edge density
edge_density
np.float64(0.11861243479654147)
Repeat for all tracts¶
# Make csv for ndvi data
ndvi_stats_path_veg = os.path.join(data_dir, 'chicago-ndvi-stats-veg.csv')
# Skip this step if data are already downloaded
if not scene_df is None:
# Loop through the census tracts with URLs
for tract, tract_date_gdf in tqdm(scene_df.groupby('tract')):
### Open all images for tract
# Create a list to store NDVI arrays, with 1 array per NAIP image
tile_das = []
# Iterate over rows in tract-specific dataframe
for _, href_s in tract_date_gdf.iterrows():
# Open vsi connection to data
tile_da = rxr.open_rasterio(
# Mask and squeeze
href_s.rgbir_href, masked = True).squeeze()
### Clip data
# Get tract boundary
boundary = (
tract_cdc_gdf
.set_index('tract2010') # Set index
.loc[[tract]] # Set the geometry
.to_crs(tile_da.rio.crs) # Ensure crs match
.geometry # Extract geometry
)
# Crop to bounding box
crop_da = tile_da.rio.clip_box(
*boundary.envelope.total_bounds,
auto_expand = True
)
# Clip to actual tract geometry
clip_da = crop_da.rio.clip(boundary, all_touched = True)
# Compute NDVI
ndvi_da = ((clip_da.sel(band = 4) - clip_da.sel(band = 1))
/ (clip_da.sel(band = 4) + clip_da.sel(band = 1)))
# Accumulate result
tile_das.append(ndvi_da)
# Merge data
scene_da = rxrmerge.merge_arrays(tile_das)
# Mask vegetation
veg_mask = (scene_da > .3)
### Calculate statistics and save data to file
# Count all valid pixels in the tract
total_pixels = scene_da.notnull().sum()
# Count all vegetated pixels
veg_pixels = veg_mask.sum()
# Identify vegetation patches
labeled_patches, num_patches = label(veg_mask)
# Count patch pixels
patch_sizes = np.bincount(labeled_patches.ravel())[1:]
# Mean patch size
mean_patch_size = patch_sizes.mean()
### Calculate edge density
# Define the kernel
kernel = np.array([
[1, 1, 1],
[1, -8, 1],
[1, 1, 1]])
# Detect boundaries between veg and non-veg
edges = convolve(veg_mask, kernel, mode = 'constant')
# Calculate proportion of edge pixels by tract area (normalization)
edge_density = np.sum(edges !=0) / veg_mask.size
### Add a row to the statistics file for this tract
# Create new datafram for the tract and append it to the csv
pd.DataFrame(dict(
# Store tract ID
tract = [tract],
# Store total pixels
total_pixels = [int(total_pixels)],
# Fraction of vegetated pixels
frac_veg = [float(veg_pixels / total_pixels)],
# Store mean patch size
mean_patch_size = [mean_patch_size],
# Edge density
edge_density = [edge_density]
# Write out as a csv
)).to_csv(
ndvi_stats_path_veg,
# Append mode
mode = 'a',
# Get rid of row numbers
index = False,
# Write column names if the csv doesn't exist yet
header = (not os.path.exists(ndvi_stats_path_veg))
)
### Re-load results from file
0%| | 0/788 [00:00<?, ?it/s]
C:\Users\nymve\AppData\Local\Temp\ipykernel_17296\2508500637.py:66: RuntimeWarning: Mean of empty slice. mean_patch_size = patch_sizes.mean() c:\Users\nymve\miniconda3\envs\earth-analytics-python\Lib\site-packages\numpy\_core\_methods.py:144: RuntimeWarning: invalid value encountered in scalar divide ret = ret.dtype.type(ret / rcount)
# Check the file
ndvi_stats_df = pd.read_csv(ndvi_stats_path_veg)
ndvi_stats_df
| tract | total_pixels | frac_veg | mean_patch_size | edge_density | |
|---|---|---|---|---|---|
| 0 | 17031010100 | 1059033 | 0.178657 | 55.225919 | 0.118612 |
| 1 | 17031010201 | 1531554 | 0.213859 | 57.543394 | 0.161668 |
| 2 | 17031010202 | 978546 | 0.186055 | 63.260250 | 0.123673 |
| 3 | 17031010300 | 1308392 | 0.191675 | 57.113642 | 0.126384 |
| 4 | 17031010400 | 1516964 | 0.198563 | 52.983817 | 0.079474 |
| ... | ... | ... | ... | ... | ... |
| 783 | 17031843500 | 5647650 | 0.075254 | 9.732779 | 0.104686 |
| 784 | 17031843600 | 1142916 | 0.054393 | 9.177296 | 0.101217 |
| 785 | 17031843700 | 6025768 | 0.027644 | 7.477533 | 0.047642 |
| 786 | 17031843800 | 3639014 | 0.093920 | 24.169295 | 0.105052 |
| 787 | 17031843900 | 4521964 | 0.199113 | 23.985215 | 0.124282 |
788 rows × 5 columns
# Merge census data with geometry
chi_ndvi_cdc_gdf = (
# Grab Census tract gdf
tract_cdc_gdf
# Merge with NDVI df
.merge(
ndvi_stats_df,
### match on the tract ID
### left: tract2010
### right: tract
left_on = 'tract2010',
right_on = 'tract',
how = 'inner'
)
)
# Check it out
chi_ndvi_cdc_gdf
| place2010 | tract2010 | ST | PlaceName | plctract10 | PlcTrPop10 | geometry | year | tract_x | high_blood_pressure | hbp_ci_low | hbp_ci_high | data_value_unit | totalpopulation | totalpop18plus | tract_y | total_pixels | frac_veg | mean_patch_size | edge_density | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1714000 | 17031010100 | 17 | Chicago | 1714000-17031010100 | 4854 | POLYGON ((-9758835.381 5164429.383, -9758837.3... | 2023 | 17031010100 | 32.6 | 30.1 | 35.2 | % | 4905 | 4083 | 17031010100 | 1059033 | 0.178657 | 55.225919 | 0.118612 |
| 1 | 1714000 | 17031010201 | 17 | Chicago | 1714000-17031010201 | 6450 | POLYGON ((-9760143.496 5163888.741, -9760143.4... | 2023 | 17031010201 | 33.1 | 30.5 | 35.7 | % | 6939 | 5333 | 17031010201 | 1531554 | 0.213859 | 57.543394 | 0.161668 |
| 2 | 1714000 | 17031010202 | 17 | Chicago | 1714000-17031010202 | 2818 | POLYGON ((-9759754.212 5163883.196, -9759726.6... | 2023 | 17031010202 | 32.6 | 29.9 | 35.2 | % | 2742 | 2296 | 17031010202 | 978546 | 0.186055 | 63.260250 | 0.123673 |
| 3 | 1714000 | 17031010300 | 17 | Chicago | 1714000-17031010300 | 6236 | POLYGON ((-9758695.229 5163870.91, -9758695.78... | 2023 | 17031010300 | 32.2 | 29.6 | 34.7 | % | 6305 | 5606 | 17031010300 | 1308392 | 0.191675 | 57.113642 | 0.126384 |
| 4 | 1714000 | 17031010400 | 17 | Chicago | 1714000-17031010400 | 5042 | POLYGON ((-9757724.634 5160715.939, -9757742.2... | 2023 | 17031010400 | 22.5 | 20.3 | 24.5 | % | 5079 | 4742 | 17031010400 | 1516964 | 0.198563 | 52.983817 | 0.079474 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 783 | 1714000 | 17031770700 | 17 | Chicago | 1714000-17031770700 | 0 | POLYGON ((-9780753.304 5157066.079, -9780752.0... | 2023 | 17031770700 | 31.3 | 28.7 | 34.0 | % | 2537 | 2093 | 17031770700 | 5547730 | 0.022761 | 7.584359 | 0.020706 |
| 784 | 1714000 | 17031770800 | 17 | Chicago | 1714000-17031770800 | 0 | POLYGON ((-9783235.84 5154620.343, -9783211.23... | 2023 | 17031770800 | 28.4 | 25.8 | 30.9 | % | 5661 | 4457 | 17031770800 | 174812 | 0.138652 | 6.197392 | 0.172915 |
| 785 | 1714000 | 17031805600 | 17 | Chicago | 1714000-17031805600 | 0 | POLYGON ((-9776210.02 5161605.738, -9776213.47... | 2023 | 17031805600 | 25.9 | 23.3 | 28.3 | % | 4710 | 3420 | 17031805600 | 463 | 0.000000 | NaN | 0.000000 |
| 786 | 1714000 | 17031807900 | 17 | Chicago | 1714000-17031807900 | 0 | POLYGON ((-9768609.902 5160576.634, -9768654.5... | 2023 | 17031807900 | 31.6 | 28.9 | 34.2 | % | 4201 | 3396 | 17031807900 | 26883 | 0.079121 | 23.898876 | 0.101617 |
| 787 | 1714000 | 17031808100 | 17 | Chicago | 1714000-17031808100 | 0 | MULTIPOLYGON (((-9774480.671 5161127.722, -977... | 2023 | 17031808100 | 42.2 | 38.8 | 45.5 | % | 4010 | 3648 | 17031808100 | 44540 | 0.303839 | 19.785088 | 0.154949 |
788 rows × 20 columns
# Additional testing from when playing around to get different level of rows
clean_df = chi_ndvi_cdc_gdf.drop_duplicates(
subset="tract2010",
keep="last"
)
# Additional testing
print("Original rows:", len(chi_ndvi_cdc_gdf))
print("After cleaning:", len(clean_df))
print("Unique tracts:", clean_df["tract2010"].nunique())
Original rows: 788 After cleaning: 788 Unique tracts: 788
### Plot chloropleths with vegetation statistics
# Make a plot_chloropleth function
def plot_chloropleth(gdf, **opts):
# Docstring to describe the function
"""Generate a chloropleth with the given color column"""
# Use geoviews to make a polygon object
return gv.Polygons (
gdf.to_crs(ccrs.Mercator()),
crs = ccrs.Mercator()
# Drop x and y axis and add a legend
).opts(xaxis = None, yaxis = None, colorbar = True, **opts)
# Apply the function to chicago data
(
plot_chloropleth(
# Plot high blood pressure by census tract
chi_ndvi_cdc_gdf,
color = 'high_blood_pressure',
cmap = 'viridis',
title = 'High Blood Pressure Rates in Chicago')
+
# Add a second plot with vegetation edge density
plot_chloropleth(
chi_ndvi_cdc_gdf,
color = 'edge_density',
cmap = 'Greens',
title = 'Edge Density in Chicago')
)
Case Study: Chicago Edge Density vs High Blood Pressure Rates
The comparison of census tracts shows some overlap between tracts with higher HBP rates and higher edge density (a measure of vegetation and habitat fragmentation) that was calculated. Edge density represents how many edges exist between different types of landscape features within a boundary area, in this case the tract. Higher edge density is associated with more fragmented vegetation or habitat. Edge density was normalized to account for the difference in pixel size by taking edge density / vegetation pixels, so that census tracts could be compared. If tracts with higher edge density tend to show higher HBP rates, there could be a link between urban greenspace fragmentation and human health outcomes such as HBP. High levels of habitat fragmentation have a lot of negative outcomes not only for human health but also for ecological health including changes in microclimate, wildlife mobility, & further changes to ecological fragmentation.
It is possible that these tracts may be historically minority areas, which have experienced less public funding over time and perhaps increased industrial development. This may have led to less habitat conservation and a more fragmented environment which would lead to higher edge density and a lack of quality greenspace within the census tract. If there is a connection between edge density and public health, then urban, environmental, and land-use planning are even more important in heavily populated areas.
Ordinary Least Squares (OLS) Model: Chicago High Blood Pressure Rates & Urban Greenspace¶
To conduct an Ordinary Least Squares (OLS) model analysis, there are several factors that are considered and are assumed to use this model. We assume that the relationship between variables such as edge density and patch size related to the corresponding high blood pressure to be a linear relationship. That the errors (residuals = observed y - predicted y) in our data are normally distributed and that our x variables are not highly collinear. Each census tract is assumed to be spatially independent. Log transformations were used to reduce the skew of the data, make it easier to make inferences from the data, and support the assumption of a normal distribution of error. The relationship between our x and y variables is stationary, or the relationship remains the same throughout Chicago. We removed missing data values to meet the requirement for having complete observations.
The objective of this model is to determine if x variables, such as edge density, can be correlated with high blood pressure. We could evaluate this with R2 and p-values. We use q-q plots to support and check that these errors are normally distributed. We can look at the error distribution spatially to look for clustering.
A potential problem with the OLS model is that there may be less spatial independence between tracts or that the tracts may be spatially correlated. This is something to consider for the analysis and future analysis. The relationship between x and y may not be stationary throughout all of Chicago, meaning that the city may not necessarily be uniform and there are additional factors in different census tracts. Some of the variables may have collinearity with each other, which could make false correlations if not considered.
Step 5a - Data preparation¶
# Check missing values per colum
chi_ndvi_cdc_gdf.isna().any()
place2010 False tract2010 False ST False PlaceName False plctract10 False PlcTrPop10 False geometry False year False tract_x False high_blood_pressure False hbp_ci_low False hbp_ci_high False data_value_unit False totalpopulation False totalpop18plus False tract_y False total_pixels False frac_veg False mean_patch_size True edge_density False dtype: bool
# Check missing values per row
chi_ndvi_cdc_gdf.isna().any(axis = 1)
0 False
1 False
2 False
3 False
4 False
...
783 False
784 False
785 True
786 False
787 False
Length: 788, dtype: bool
# Check for missing data
chi_ndvi_cdc_gdf.isna().any().any()
np.True_
# Check sum of the data
chi_ndvi_cdc_gdf.isna().sum()
place2010 0 tract2010 0 ST 0 PlaceName 0 plctract10 0 PlcTrPop10 0 geometry 0 year 0 tract_x 0 high_blood_pressure 0 hbp_ci_low 0 hbp_ci_high 0 data_value_unit 0 totalpopulation 0 totalpop18plus 0 tract_y 0 total_pixels 0 frac_veg 0 mean_patch_size 1 edge_density 0 dtype: int64
# Select variables: Make dataframe with the variables we want for the model
# Visualize distribution of data
# Select variables
variables = ['frac_veg', 'high_blood_pressure', 'mean_patch_size', 'edge_density']
# create scatter plot
hvplot.scatter_matrix(
chi_ndvi_cdc_gdf
[variables]
)
### Histograms
### Make facet
fig, axes = plt.subplots(2, 2, figsize = (12, 10))
### List of variables to plot
variables = ['frac_veg', 'high_blood_pressure', 'mean_patch_size', 'edge_density']
titles = ['Vegetated fraction', 'Adult HBP rate', 'Mean patch size', 'Edge density']
### Loop through variables and axes to plot histograms
for i, (var, title) in enumerate(zip(variables, titles)):
ax = axes[i//2, i%2]
ax.hist(chi_ndvi_cdc_gdf[var], bins = 10, color = 'gray', edgecolor = 'black')
ax.set_title(title)
ax.set_xlabel(var)
ax.set_ylabel('Frequency')
### Adjust layers to prevent overlap
plt.tight_layout()
plt.show()
# Drop missing obs
model_df = (
chi_ndvi_cdc_gdf
# Make a copy
.copy()
# Subset the columns
[['frac_veg', 'high_blood_pressure', 'mean_patch_size', 'edge_density', 'geometry']]
# Drop rows (observations) with missing values
.dropna()
)
# Display the model
model_df
| frac_veg | high_blood_pressure | mean_patch_size | edge_density | geometry | |
|---|---|---|---|---|---|
| 0 | 0.178657 | 32.6 | 55.225919 | 0.118612 | POLYGON ((-9758835.381 5164429.383, -9758837.3... |
| 1 | 0.213859 | 33.1 | 57.543394 | 0.161668 | POLYGON ((-9760143.496 5163888.741, -9760143.4... |
| 2 | 0.186055 | 32.6 | 63.260250 | 0.123673 | POLYGON ((-9759754.212 5163883.196, -9759726.6... |
| 3 | 0.191675 | 32.2 | 57.113642 | 0.126384 | POLYGON ((-9758695.229 5163870.91, -9758695.78... |
| 4 | 0.198563 | 22.5 | 52.983817 | 0.079474 | POLYGON ((-9757724.634 5160715.939, -9757742.2... |
| ... | ... | ... | ... | ... | ... |
| 782 | 0.020053 | 27.9 | 13.674242 | 0.029123 | MULTIPOLYGON (((-9787333.178 5161561.245, -978... |
| 783 | 0.022761 | 31.3 | 7.584359 | 0.020706 | POLYGON ((-9780753.304 5157066.079, -9780752.0... |
| 784 | 0.138652 | 28.4 | 6.197392 | 0.172915 | POLYGON ((-9783235.84 5154620.343, -9783211.23... |
| 786 | 0.079121 | 31.6 | 23.898876 | 0.101617 | POLYGON ((-9768609.902 5160576.634, -9768654.5... |
| 787 | 0.303839 | 42.2 | 19.785088 | 0.154949 | MULTIPOLYGON (((-9774480.671 5161127.722, -977... |
787 rows × 5 columns
### Perform variable transformation
# Log of high_blood_pressure rate
model_df['log_hbp'] = np.log(model_df.high_blood_pressure)
# Log of patch size
model_df['log_patch'] = np.log(model_df.mean_patch_size)
# Show df after transformation
model_df
| frac_veg | high_blood_pressure | mean_patch_size | edge_density | geometry | log_hbp | log_patch | |
|---|---|---|---|---|---|---|---|
| 0 | 0.178657 | 32.6 | 55.225919 | 0.118612 | POLYGON ((-9758835.381 5164429.383, -9758837.3... | 3.484312 | 4.011432 |
| 1 | 0.213859 | 33.1 | 57.543394 | 0.161668 | POLYGON ((-9760143.496 5163888.741, -9760143.4... | 3.499533 | 4.052539 |
| 2 | 0.186055 | 32.6 | 63.260250 | 0.123673 | POLYGON ((-9759754.212 5163883.196, -9759726.6... | 3.484312 | 4.147257 |
| 3 | 0.191675 | 32.2 | 57.113642 | 0.126384 | POLYGON ((-9758695.229 5163870.91, -9758695.78... | 3.471966 | 4.045043 |
| 4 | 0.198563 | 22.5 | 52.983817 | 0.079474 | POLYGON ((-9757724.634 5160715.939, -9757742.2... | 3.113515 | 3.969987 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 782 | 0.020053 | 27.9 | 13.674242 | 0.029123 | MULTIPOLYGON (((-9787333.178 5161561.245, -978... | 3.328627 | 2.615514 |
| 783 | 0.022761 | 31.3 | 7.584359 | 0.020706 | POLYGON ((-9780753.304 5157066.079, -9780752.0... | 3.443618 | 2.026088 |
| 784 | 0.138652 | 28.4 | 6.197392 | 0.172915 | POLYGON ((-9783235.84 5154620.343, -9783211.23... | 3.346389 | 1.824129 |
| 786 | 0.079121 | 31.6 | 23.898876 | 0.101617 | POLYGON ((-9768609.902 5160576.634, -9768654.5... | 3.453157 | 3.173831 |
| 787 | 0.303839 | 42.2 | 19.785088 | 0.154949 | MULTIPOLYGON (((-9774480.671 5161127.722, -977... | 3.742420 | 2.984929 |
787 rows × 7 columns
# Visualize transformed variables
hvplot.scatter_matrix(
model_df
[[
'frac_veg',
'edge_density',
'log_patch',
'log_hbp'
]]
)
### Q-Q plots
# Set variables
var_qq = ['frac_veg', 'edge_density', 'log_patch', 'log_hbp']
plt.figure(figsize = (12, 10))
for i, var in enumerate(var_qq, 1):
# Make a 2x2 fact
plt.subplot(2, 2, i)
# Normalize distribution
stats.probplot(model_df[var], dist = "norm", plot = plt)
# Add title
plt.title(f'Q-Q plot fo {var}')
# Plot it
plt.tight_layout()
plt.show()
Transformations & Plot Selections¶
First we performed a log transformation of the data to reduce the skew of the data and to be able to make better inferences of our data correlations. We chose to do a Q-Q plot to look at whether our data fit using an OLS model with the assumption that the errors are normally distributed. Mean patch size does seem to fit well and edge density is just very lightly curved or just slightly skewed. High blood pressure shows very small s-curve but is likely normal noise in the data. There are strong between variables edge_density and frac_veg, so there could be possible multicollinearity which is something to consider for using the OLS model. Overall, the Q-Q plots suggest that OLS meets most of the required assumptions are met.
Step 5b - Fit and Predict¶
# Select predictor and outcome variables
X = model_df[['edge_density', 'log_patch']]
y = model_df[['log_hbp']]
# Split into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.33, random_state = 19
)
# Fit a linear regression
reg = LinearRegression()
# Fit our data
reg.fit(X_train, y_train)
# Predict high_blood_pressure values for the test dataset
y_test['pred_hbp'] = np.exp(reg.predict(X_test))
# Real high_blood_pressure rate for comparison
y_test['high_blood_pressure'] = np.exp(y_test.log_hbp)
y_test
| log_hbp | pred_hbp | high_blood_pressure | |
|---|---|---|---|
| 295 | 2.995732 | 23.869860 | 20.0 |
| 549 | 3.666122 | 36.667828 | 39.1 |
| 595 | 3.858622 | 36.374828 | 47.4 |
| 378 | 2.954910 | 28.591931 | 19.2 |
| 16 | 3.538057 | 33.421144 | 34.4 |
| ... | ... | ... | ... |
| 548 | 3.475067 | 31.311823 | 32.3 |
| 260 | 2.833213 | 28.360528 | 17.0 |
| 504 | 3.751854 | 29.439412 | 42.6 |
| 529 | 3.269569 | 31.136613 | 26.3 |
| 59 | 2.944439 | 28.355779 | 19.0 |
260 rows × 3 columns
# Plot measured vs. predicted high blood pressure prevalence with a 1-to-1 line
y_max = y_test.high_blood_pressure.max()
(
y_test
.hvplot.scatter(x='high_blood_pressure',
y='pred_hbp',
xlabel = "measured adult high blood pressure prevalence",
ylabel = "Predicted adult high blood pressure prevalence",
title = "Linear regression performance = testing data")
.opts(aspect='equal',
xlim=(0, y_max), ylim=(0, y_max),
width=600, height=600)
) * hv.Slope(slope=1, y_intercept=0).opts(color='black')
Step 5c - Explore spatial bias¶
### Compute model error for all census tracts
# Use the trained model to predict high_blood_pressure prevalence for each census tract
model_df['pred_hbp'] = np.exp(reg.predict(X))
# Calculate model error
model_df['error_hbp'] = model_df['pred_hbp'] - model_df['high_blood_pressure']
# Plot error geographically as a chloropleth
(
plot_chloropleth(model_df,
color = 'error_hbp',
cmap = 'RdBu',
tools = ['hover']) # Add hover legend with information
# Set the frame width and aspect ratio
.opts(
frame_width = 600,
aspect = 'equal',
title = 'Prediction Error in High Blood Pressure Rates: Chicago Analysis', # Add title
clabel= "High Blood PRessure Prediction Error")
)
The errors, either negative or positive, meaning the predictor was over- or underestimated, are clustered together. This might mean that the census tracts are not spatially independent of each other, a requirement for the OLS model. It could mean that something else in those census tracts is affecting high blood pressure in that area, such as pollution, air quality, socioeconomic status, etc. The negative errors appear to be visually correlated with Black-majority census tracts, and the positive errors with Hispanic- or White-majority regions. For future analysis, a Moran’s I test could be run to examine the degree of spatial autocorrelation.
It may be useful to add demographic data (race/ethnicity) or socioeconomic variables (income/education) to this analysis. Other potential metrics include air quality (PM2.5, ozone), industrial density, or housing density. It is possible that another model, such as a spatial linear regression or non-linear regression, may be better suited. There is also an option in scikit-learn for a Box–Cox transformation that may be interesting to explore in future analysis.
Conclusion¶
In this data analysis, the relationship between urban greenspace and adult high blood pressure was explored across Chicago census tracts. Edge density, vegetation fragmentation, and mean patch size were landscape metrics that were visualized and mapped using NDVI calculations from NAIP imagery. Edge density was normalized by vegetation pixels so that the census tracts could be compared. A higher edge density is associated with a more fragmented greenspace, and a higher mean patch size is associated with more intact greenspace. A log transformation was performed on the data to reduce skew and make it better for using the OLS linear regression model. The Q-Q plots showed a good fit for OLS, with just a small amount of noise. In the scatter plots, there may be some multicollinearity between edge density and vegetation fragmentation, so one variable may need to be removed for future analysis to reduce this violation of assumptions for the OLS model. In the final spatial error map (predicted - actual), we observe clustering of both positive and negative values, indicating that some areas are over- or under-predicted. The negative (under-prediction) values are visually overlapped with census tracts associated with Black-majority, and positive (over-predictions) are associated with other census tracts. This might indicate that the census tracts may not be spatially independent of each other and that other factors may be contributing to HBP, such as race, socioeconomic class, industry, or pollution. While there was some visual association between higher edge density and higher rates of HBP, and higher mean patch size with lower rates of HBP, it was not a perfectly linear relationship. The results with the spatial error map suggest that there is more contributing to the high blood pressure than these landscape metrics alone. Future analysis could include variables like race or socioeconomic demographics, better fit models, or explore other data transformations. Further analysis is needed, but greenspace and public health remain an important area for research and exploration to determine the full implications of the relationship.
Citations¶
Chicago Racial Dot Map 2020 (2024) ArcGIS. https://www.arcgis.com/home/item.html?id=0efb750426dc4d3381f5385b437e1816&utm_source=chatgpt.com
Kondo, M. C., Fluehr, J. M., McKeon, T., & Branas, C. C. (2018). Urban Green Space and Its Impact on Human Health. International Journal of Environmental Research and Public Health, 15(3), 445. https://doi.org/10.3390/ijerph15030445
Liu, D., Kwan, M.-P., & Kan, Z. (2021). Analysis of urban green space accessibility and distribution inequity in the City of Chicago. Urban Forestry & Urban Greening, Volume 59(127029). https://www.sciencedirect.com/science/article/pii/S1618866721000546
Mitchell, K. (2025, April 21). The distribution of parks in Chicago. ArcGIS StoryMaps. https://storymaps.arcgis.com/stories/883b4a42aa9e4cada0c66f53b0211ba5
Schwegman, D. J. (2025). Historical housing discrimination, redlining, and the contemporary distribution of local economic development funding: The case of Chicago. Economic Development Quarterly, 39(2), 75–92. https://doi.org/10.1177/08912424241309321
U.S. Census Bureau. (2024). QuickFacts: Chicago city, Illinois. U.S. Department of Commerce. https://www.census.gov/quickfacts/fact/table/chicagocityillinois/PST045224
Zhao, Y., Bao, W.-W., Yang, B.-Y., Liang, J.-H., Gui, Z.-H., Huang, S., Chen, Y.-C., Dong, G.-H., & Chen, Y.-J. (2022). Association between greenspace and blood pressure: A systematic review and meta-analysis. Science of the Total Environment, 817, 152513. https://doi.org/10.1016/j.scitotenv.2021.152513